Skip to content

18630: Remove FilterExec from CoalesceBatches optimization rule#15

Open
martin-augment wants to merge 3 commits intomainfrom
pr-18630-2025-11-12-09-51-01
Open

18630: Remove FilterExec from CoalesceBatches optimization rule#15
martin-augment wants to merge 3 commits intomainfrom
pr-18630-2025-11-12-09-51-01

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18630: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 12, 2025

Walkthrough

This pull request removes CoalesceBatchesExec wrapper optimization around FilterExec operators from the physical query planner. The core change occurs in the coalesce batches optimizer, which now only wraps HashJoinExec instead of both HashJoinExec and FilterExec. This results in simplified execution plans across numerous SQL queries and test cases, where FilterExec operators now appear directly in the plan without an intermediate coalescing batch stage, while the underlying filter logic and data source operations remain functionally equivalent.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18630-2025-11-12-09-51-01

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 12, 2025

Code Review: Remove FilterExec from CoalesceBatches optimization rule

Summary

This PR removes FilterExec from the list of execution nodes that are automatically wrapped with CoalesceBatchesExec in the physical optimizer.

Positive Aspects

  1. Well-justified change: FilterExec already has built-in batch coalescing via BatchCoalescer (line 412 in filter.rs), making the external CoalesceBatchesExec wrapper redundant.

  2. Reduces plan complexity: Removing unnecessary CoalesceBatchesExec nodes simplifies physical execution plans.

  3. Code quality: The change is minimal, focused, and follows the single responsibility principle.

  4. Comprehensive test updates: All affected test expectations have been updated across 57 test files.

Review Findings

Code Quality: Good

  • The implementation correctly removes the FilterExec import and the corresponding check.
  • Comment on line 60-62 still accurately describes the rule purpose.

Potential Issues: Minor Concern

  • Missing documentation update: The comment on lines 35-36 mentions highly selective filters as a use case, but this PR removes filter handling. Consider updating the comment.

Performance: Excellent

  • Positive impact: Eliminates double coalescing overhead since FilterExec uses BatchCoalescer internally.
  • No regression risk: The built-in coalescing provides the same functionality.

Security: None identified

Test Coverage: Excellent

  • 57 test files updated with new expected plans
  • All updates correctly reflect the removal of CoalesceBatchesExec wrapping

Recommendations

  1. Update the rule comment (lines 35-36) to remove the reference to highly selective filters.

  2. Consider adding a code comment explaining why FilterExec is NOT in the list of wrapped operators.

  3. Optional: Add a test that explicitly verifies FilterExec is NOT wrapped.

Conclusion

Recommendation: Approve with minor documentation improvements

This is a solid optimization that removes redundant functionality. The implementation is correct, well-tested, and should improve performance by eliminating double batch coalescing.

Impact: Low risk, positive performance impact, improved plan readability.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Comment augment review to trigger a new review at any time.

// See https://github.com/apache/datafusion/issues/139
let wrap_in_coalesce = plan_any.downcast_ref::<FilterExec>().is_some()
|| plan_any.downcast_ref::<HashJoinExec>().is_some()
let wrap_in_coalesce = plan_any.downcast_ref::<HashJoinExec>().is_some()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment above still mentions handling “small batches that are produced by highly selective filters,” but the implementation no longer wraps FilterExec. Consider updating that comment to reflect the current behavior (HashJoin/Repartition and AsyncFuncExec) to keep docs accurate.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-but-wont-fix; category:documentation; feedback:The Augment AI reviewer is correct that the comment mentions "filter" but the result could be filtered not only by FilterExec. HashJoinExec also filters the results by returning only those matching the join condition.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
datafusion/physical-optimizer/src/coalesce_batches.rs (1)

35-37: Update the documentation to reflect the removal of FilterExec.

The comment mentions "highly selective filters" as a motivation for this optimizer rule, but the rule no longer wraps FilterExec operators. Consider updating this comment to reflect the current scope of the optimizer.

Apply this diff to update the documentation:

-/// Optimizer rule that introduces CoalesceBatchesExec to avoid overhead with small batches that
-/// are produced by highly selective filters
+/// Optimizer rule that introduces CoalesceBatchesExec to avoid overhead with small batches
datafusion/core/tests/dataframe/mod.rs (1)

769-771: Reduce snapshot brittleness around plan shapes (optional)

Consider asserting for the presence/ordering of key nodes (AggregateExec → FilterExec → DataSourceExec) instead of full multiline snapshots to lessen churn when display wrappers change.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49fc0 and 44e760d.

📒 Files selected for processing (58)
  • datafusion/core/tests/dataframe/mod.rs (5 hunks)
  • datafusion/core/tests/sql/explain_analyze.rs (2 hunks)
  • datafusion/physical-optimizer/src/coalesce_batches.rs (2 hunks)
  • datafusion/sqllogictest/test_files/aggregate.slt (2 hunks)
  • datafusion/sqllogictest/test_files/array.slt (6 hunks)
  • datafusion/sqllogictest/test_files/async_udf.slt (1 hunks)
  • datafusion/sqllogictest/test_files/count_star_rule.slt (1 hunks)
  • datafusion/sqllogictest/test_files/cte.slt (5 hunks)
  • datafusion/sqllogictest/test_files/dictionary.slt (3 hunks)
  • datafusion/sqllogictest/test_files/explain.slt (1 hunks)
  • datafusion/sqllogictest/test_files/explain_tree.slt (17 hunks)
  • datafusion/sqllogictest/test_files/filter_without_sort_exec.slt (6 hunks)
  • datafusion/sqllogictest/test_files/join.slt.part (3 hunks)
  • datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt (1 hunks)
  • datafusion/sqllogictest/test_files/join_is_not_distinct_from.slt (1 hunks)
  • datafusion/sqllogictest/test_files/joins.slt (8 hunks)
  • datafusion/sqllogictest/test_files/limit.slt (1 hunks)
  • datafusion/sqllogictest/test_files/map.slt (1 hunks)
  • datafusion/sqllogictest/test_files/operator.slt (2 hunks)
  • datafusion/sqllogictest/test_files/options.slt (2 hunks)
  • datafusion/sqllogictest/test_files/parquet.slt (4 hunks)
  • datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt (6 hunks)
  • datafusion/sqllogictest/test_files/parquet_statistics.slt (3 hunks)
  • datafusion/sqllogictest/test_files/predicates.slt (5 hunks)
  • datafusion/sqllogictest/test_files/projection.slt (1 hunks)
  • datafusion/sqllogictest/test_files/push_down_filter.slt (6 hunks)
  • datafusion/sqllogictest/test_files/pwmj.slt (5 hunks)
  • datafusion/sqllogictest/test_files/qualify.slt (4 hunks)
  • datafusion/sqllogictest/test_files/regexp/regexp_like.slt (1 hunks)
  • datafusion/sqllogictest/test_files/repartition.slt (1 hunks)
  • datafusion/sqllogictest/test_files/repartition_scan.slt (6 hunks)
  • datafusion/sqllogictest/test_files/select.slt (6 hunks)
  • datafusion/sqllogictest/test_files/simplify_expr.slt (4 hunks)
  • datafusion/sqllogictest/test_files/subquery.slt (2 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q1.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q10.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q11.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q12.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q13.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q14.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q15.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q16.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q17.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q18.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q19.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q2.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q20.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q21.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q22.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q3.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q4.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q5.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q6.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q7.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q8.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/tpch/plans/q9.slt.part (1 hunks)
  • datafusion/sqllogictest/test_files/union.slt (2 hunks)
  • datafusion/sqllogictest/test_files/window.slt (13 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (79)
datafusion/sqllogictest/test_files/async_udf.slt (1)

83-87: Test expectation update looks correct.

The updated physical plan correctly reflects the removal of CoalesceBatchesExec wrapper around FilterExec as per the PR objective. The plan now starts with FilterExec at the top level, simplifying the execution hierarchy while preserving the filter logic and downstream operations. The CoalesceBatchesExec node is appropriately retained deeper in the plan (after AsyncFuncExec), where it still serves a purpose.

datafusion/sqllogictest/test_files/tpch/plans/q3.slt.part (1)

16-26: LGTM! Test expectations correctly updated to reflect the optimization change.

The physical plan expectations have been properly updated to show FilterExec operations appearing directly without CoalesceBatchesExec wrappers. The plan structure remains logically sound:

  • Filters on customer (line 16), orders (line 21), and lineitem (line 25) now execute directly before their respective data sources
  • CoalesceBatchesExec is appropriately retained around hash join operations (lines 19, 23) where batch coalescing provides value
  • The overall query semantics and execution flow are preserved
datafusion/sqllogictest/test_files/tpch/plans/q10.slt.part (1)

23-31: Test expectations correctly updated to reflect the optimization change.

The physical plan now shows FilterExec operators appearing directly without CoalesceBatchesExec wrappers (lines 23 and 27), which aligns with the PR objective of removing FilterExec from the CoalesceBatches optimization rule. The plan structure remains logically sound:

  • Filters are still properly applied after DataSourceExec (pushdown optimization)
  • CoalesceBatchesExec continues to wrap HashJoinExec operations (lines 76, 79-80, 83, etc.)
  • All predicates, projections, and join operations are preserved
datafusion/sqllogictest/test_files/tpch/plans/q13.slt.part (1)

72-73: Test expectation correctly updated to reflect optimizer change.

The physical plan now shows FilterExec directly without the CoalesceBatchesExec wrapper, which aligns with the PR objective to remove FilterExec from the CoalesceBatches optimization rule. The underlying filter logic, projections, and data source configuration remain functionally equivalent.

datafusion/sqllogictest/test_files/tpch/plans/q21.slt.part (1)

121-138: LGTM! Test expectations correctly updated to reflect optimizer changes.

The physical plan expectations have been properly updated to show FilterExec operations without the CoalesceBatchesExec wrapper, which aligns with the PR objective of removing FilterExec from the CoalesceBatches optimization rule. All four FilterExec instances (lines 121, 125, 129, 137) are consistently updated, while CoalesceBatchesExec is correctly retained around HashJoinExec operations throughout the plan.

datafusion/sqllogictest/test_files/tpch/plans/q16.slt.part (1)

23-30: LGTM! Test expectations correctly reflect the optimizer change.

The updated physical plan correctly shows FilterExec operators appearing directly in the execution tree without CoalesceBatchesExec wrappers, which aligns with the PR objective of removing FilterExec from the CoalesceBatches optimization rule. The plan structure remains logically equivalent, with all filter predicates and execution operators properly preserved.

datafusion/sqllogictest/test_files/regexp/regexp_like.slt (1)

254-255: Test expectation correctly updated to reflect optimizer change.

The physical plan now shows FilterExec directly without the CoalesceBatchesExec wrapper, which correctly reflects the PR's objective of removing FilterExec from the CoalesceBatches optimization rule. The test logic and dictionary optimization behavior remain unchanged.

datafusion/sqllogictest/test_files/tpch/plans/q5.slt.part (2)

100-102: LGTM! FilterExec now appears without CoalesceBatchesExec wrapper.

This change correctly implements the PR objective by removing the CoalesceBatchesExec wrapper around the FilterExec for the orders date range filter. The plan structure remains valid with the filter applied directly after repartitioning and before the data source.


103-115: LGTM! Consistent FilterExec optimization applied across the plan.

The changes correctly remove CoalesceBatchesExec wrappers around the region filter (line 113), while preserving coalescing around HashJoinExec operations (lines 103, 106, 109, 112). The plan structure is valid and maintains query semantics while implementing the intended optimization.

datafusion/sqllogictest/test_files/operator.slt (2)

290-318: LGTM: Test expectations correctly updated for optimizer change.

The physical plan expectations for "less than" operator tests have been consistently updated to reflect the removal of CoalesceBatchesExec wrapper around FilterExec. The filter predicates, type casting behavior, and data source details remain unchanged across all four test variants (positive integer, negative integer, decimal, negative decimal).


329-357: LGTM: Test expectations correctly updated for equality operator tests.

The physical plan expectations for equality operator tests have been consistently updated to reflect the removal of CoalesceBatchesExec wrapper around FilterExec. The filter predicates, type casting behavior, and data source details remain unchanged across all four test variants, maintaining symmetry with the less-than operator tests.

datafusion/sqllogictest/test_files/tpch/plans/q20.slt.part (1)

97-122: Plan expectation matches FilterExec unwrapping

The updated plan shows FilterExec directly after the repartition without a surrounding CoalesceBatchesExec, which is exactly what we expect after removing FilterExec from the coalesce optimization. The downstream operator sequence and projections remain consistent with the prior logical flow. Looks good.

datafusion/sqllogictest/test_files/tpch/plans/q7.slt.part (1)

32-49: Filter placement matches the optimizer change

Nice to see the FilterExec nodes now sitting directly atop their DataSourceExec inputs with the surrounding CoalesceBatchesExec wrappers gone—exactly what we expect after removing FilterExec from the coalesce rule, and the repartition topology stays intact. Looks good.

datafusion/sqllogictest/test_files/push_down_filter.slt (1)

43-44: Reject original review comment — premise is incorrect.

The original review incorrectly assumes FilterExec was removed from the CoalesceBatches optimization. In fact, the coalesce_batches.rs implementation only wraps HashJoinExec, certain RepartitionExec operators, and AsyncFuncExec—FilterExec is never included in the wrapping logic. FilterExec was never part of this optimization and therefore nothing was "removed."

The .slt test changes are simply updating expected execution plans to match the existing optimizer behavior. No underlying code change occurred to justify the premise that FilterExec exclusion was implemented.

Likely an incorrect or invalid review comment.

datafusion/sqllogictest/test_files/tpch/plans/q9.slt.part (1)

106-123: Test expectation correctly reflects the optimization change.

The physical plan update appropriately removes the CoalesceBatchesExec wrapper from around FilterExec (line 106), aligning with the PR objective to exclude FilterExec from the CoalesceBatches optimization rule. The FilterExec now appears directly in the plan, while CoalesceBatchesExec nodes remain strategically placed before hash joins and repartitioning operations (lines 104, 109, 112, 115, 118, 121). The logical plan is unchanged, confirming this is purely an execution plan optimization.

To ensure this optimization doesn't introduce performance regressions for queries with selective filters, consider running TPC-H benchmarks comparing query execution times before and after this change, particularly for queries with FilterExec operators on large tables.

datafusion/sqllogictest/test_files/qualify.slt (5)

276-278: LGTM: FilterExec wrapper removed as intended.

The physical plan correctly shows FilterExec appearing directly without a CoalesceBatchesExec wrapper, consistent with the PR objective. The execution flow (SortExec → FilterExec → WindowAggExec → DataSourceExec) remains logically sound.


292-296: LGTM: Simplified execution plan for partitioned window function.

The removal of the CoalesceBatchesExec wrapper around FilterExec (line 292) aligns with the PR objective. The plan correctly uses BoundedWindowAggExec for the partitioned row_number() function.


319-332: LGTM: Selective removal of CoalesceBatchesExec demonstrates correct optimization.

The plan correctly shows:

  • FilterExec operators at lines 320, 326, and 331 without CoalesceBatchesExec wrappers (aligned with PR objective)
  • CoalesceBatchesExec retained at lines 323 and 327, wrapping hash-partitioned RepartitionExec operations (correct behavior per the AI summary)

This demonstrates the targeted nature of the optimization change.


353-362: LGTM: Consistent FilterExec unwrapping in aggregate query plan.

The FilterExec at line 354 correctly appears without a CoalesceBatchesExec wrapper, while CoalesceBatchesExec is appropriately retained at line 359 for hash-partitioned aggregation. The plan structure properly handles the aggregate-then-filter-then-sort execution flow.


276-362: Manual verification required—sandbox cannot execute tests.

The sandbox environment lacks Rust compilation and runtime capabilities, so I cannot execute cargo test to verify the QUALIFY test suite. Since this is a test expectations file with query plan snapshots, runtime verification is essential to confirm:

  1. All QUALIFY queries still produce correct result rows (not just plan structure matches)
  2. Performance hasn't regressed with the optimization changes

Please manually run:

cargo test --test sqllogictest -- qualify.slt

to ensure all test assertions pass before merging.

datafusion/sqllogictest/test_files/tpch/plans/q8.slt.part (3)

129-131: Test expectations updated correctly for part table filtering.

The FilterExec operator now appears directly without a CoalesceBatchesExec wrapper, consistent with the PR objective. The filter predicate and projection remain correct.


138-141: Test expectations updated correctly for orders table filtering.

The FilterExec operator for the date range filter now appears without a CoalesceBatchesExec wrapper. The filter predicate correctly validates orders between 1995-01-01 and 1996-12-31.


151-155: Region table FilterExec placement is correct.

The filter predicate (r_name@1 = AMERICA) and projection (r_regionkey@0) are accurate. The execution plan structure with nested RepartitionExec operators is properly ordered.

datafusion/sqllogictest/test_files/count_star_rule.slt (1)

67-72: The review comment appears to be incorrect. FilterExec was never wrapped by the CoalesceBatches optimizer rule.

Examining the actual optimizer implementation, the wrap_in_coalesce condition only wraps three types of operators:

  1. HashJoinExec
  2. RepartitionExec (excluding RoundRobinBatch)
  3. The input to AsyncFuncExec

FilterExec is not in this list and has never been wrapped by CoalesceBatchesExec. The physical plan in the test shows CoalesceBatchesExec correctly wrapping the RepartitionExec portion downstream of the FilterExec, which is the expected and documented behavior.

The test update reflects the normal physical plan structure for a HAVING clause query—it is not evidence of removing any optimization. Without documentation that this is an intentional restructuring of how aggregates with HAVING clauses are planned, this appears to be a routine test expectation update unrelated to FilterExec optimization.

Likely an incorrect or invalid review comment.

datafusion/sqllogictest/test_files/limit.slt (1)

380-381: Test output correctly reflects the targeted optimization removing CoalesceBatchesExec around FilterExec following GlobalLimitExec.

The updated physical plan at lines 380-381 shows FilterExec appearing directly under GlobalLimitExec without the CoalesceBatchesExec wrapper, which aligns with the PR objective. Query result remains correct (still returns 1). The change is selective—CoalesceBatchesExec still wraps FilterExec in other optimization contexts elsewhere in the test suite (e.g., subquery.slt), confirming this is a targeted optimization rather than a wholesale removal.

datafusion/sqllogictest/test_files/repartition_scan.slt (2)

62-63: Test expectation correctly updated.

The physical plan now shows FilterExec directly followed by DataSourceExec without the CoalesceBatchesExec wrapper, correctly reflecting the optimization rule change.


62-63: The optimizer code changes are complete and verified.

The CoalesceBatches optimizer rule at datafusion/physical-optimizer/src/coalesce_batches.rs has been correctly updated. The wrap_in_coalesce condition (lines 63-66) now only wraps HashJoinExec and non-RoundRobin RepartitionExec operators. FilterExec is completely absent from the wrapping logic, which aligns with all test expectations across Parquet, CSV, and JSON file formats in the test file.

datafusion/sqllogictest/test_files/tpch/plans/q2.slt.part (3)

39-41: Second FilterExec also without CoalesceBatchesExec wrapper.

The filter on r_name consistently follows the same pattern as the first FilterExec, flowing directly to repartitioning. This demonstrates that the optimization rule change is being applied uniformly across all FilterExec operators in the plan.


28-72: CoalesceBatchesExec correctly preserved around HashJoinExec operations.

The optimization change selectively removes CoalesceBatchesExec around FilterExec while correctly preserving it around HashJoinExec operations (lines 28, 31, 34, 37, 42, 46, etc.). This is the expected behavior since joins can produce irregular batch sizes that benefit from coalescing, while filters typically don't alter batch structure significantly.


25-27: FilterExec operators now appear without CoalesceBatchesExec wrapper—verified in physical plan.

The file inspection confirms the changes are accurate. FilterExec operators at lines 25, 39, and 70 are no longer directly wrapped by CoalesceBatchesExec, while CoalesceBatchesExec remains around HashJoinExec nodes throughout the plan. This aligns with the PR objective.

Since automated test execution is not available in this environment, please manually verify that the SQL logic tests pass with these updated expectations to ensure query correctness is maintained.

datafusion/sqllogictest/test_files/tpch/plans/q17.slt.part (1)

67-75: Test expectation file correctly updated—no issues found.

The physical plan shows FilterExec is no longer directly wrapped in CoalesceBatchesExec. Batch coalescing is properly applied upstream at the RepartitionExec (line 12) and continues to wrap join operations and other repartitioning steps, which is the expected behavior from removing this specific optimization case.

datafusion/sqllogictest/test_files/tpch/plans/q22.slt.part (3)

93-95: Test expectation correctly updated to reflect CoalesceBatchesExec removal.

The FilterExec now appears directly under RepartitionExec without a CoalesceBatchesExec wrapper, which aligns with the PR objective. The change is consistent and the test expectation is accurate.

Note: Removing batch coalescing after filter operations may result in downstream operators receiving smaller batches, which could impact throughput in some scenarios. However, this appears to be an intentional design decision to simplify the execution plan.


99-104: Consistent removal of CoalesceBatchesExec from FilterExec in aggregate subquery.

The test expectation correctly shows FilterExec (line 102) appearing directly in the aggregation pipeline without a CoalesceBatchesExec wrapper. This change is consistent with the earlier FilterExec update (line 93) and maintains the overall query semantics while simplifying the execution plan structure.


77-104: Unable to fully verify the original review comment due to sandbox limitations.

While I confirmed that the FilterExec patterns are consistent across multiple TPC-H test files (q1, q15, q18, q22 all show FilterExec NOT wrapped by CoalesceBatchesExec), I cannot definitively determine:

  1. PR scope: Whether only q22.slt.part was modified or if all affected TPC-H test files were updated in this PR (git history not accessible in sandbox)
  2. Implementation verification: Whether the optimizer code change actually exists in the codebase (would require deep code inspection beyond sandbox capabilities)
  3. Test coverage: Whether unit tests exist for this optimization change
  4. Performance impact: Whether benchmarks have been run

The consistent pattern across sampled files suggests the optimization may be comprehensively applied, but this cannot be confirmed without access to git history and the implementation code.

Recommended manual checks:

  • Review the full git diff to confirm all intended test files were updated
  • Verify the optimizer implementation exists and is correctly gating this behavior
  • Confirm unit tests cover the CoalesceBatchesExec removal logic
  • Review performance benchmark results
datafusion/sqllogictest/test_files/pwmj.slt (1)

90-94: Filter placements look correct post-rule change.

The updated snapshots accurately reflect FilterExec sitting directly on the scans after dropping the CoalesceBatches wrapper. This aligns with the optimizer adjustment, and the expected plans stay coherent across all cases.

Also applies to: 135-139, 182-183, 229-230, 272-275

datafusion/sqllogictest/test_files/explain_tree.slt (1)

169-188: Filter plan updates match the intended optimizer change.

Nice work updating the tree explain expectations so the simple filter now shows FilterExec directly on top of RepartitionExec/DataSourceExec, reflecting the removal of the CoalesceBatchesExec wrapper. No issues spotted here.

datafusion/physical-optimizer/src/coalesce_batches.rs (1)

63-63: LGTM! FilterExec removed from coalescing logic.

The change correctly removes FilterExec from the list of operators that get wrapped with CoalesceBatchesExec, leaving only HashJoinExec (and special handling for RepartitionExec and AsyncFuncExec).

datafusion/sqllogictest/test_files/tpch/plans/q6.slt.part (1)

41-42: LGTM! Test expectations updated correctly.

The physical plan correctly reflects the removal of CoalesceBatchesExec wrapping around FilterExec. The plan now shows FilterExec directly followed by DataSourceExec, which is consistent with the optimizer change.

datafusion/sqllogictest/test_files/dictionary.slt (1)

413-414: LGTM! Test expectations updated correctly.

Physical plans throughout this file have been updated to remove CoalesceBatchesExec wrappers around FilterExec, showing the simplified plan structure consistently.

datafusion/sqllogictest/test_files/map.slt (1)

116-117: LGTM! Test expectations updated correctly.

The physical plan correctly shows FilterExec followed by DataSourceExec without the intermediate CoalesceBatchesExec node.

datafusion/sqllogictest/test_files/simplify_expr.slt (1)

29-30: LGTM! Test expectations updated correctly.

All physical plans in this file have been consistently updated to reflect the removal of CoalesceBatchesExec wrappers around FilterExec.

datafusion/sqllogictest/test_files/tpch/plans/q1.slt.part (1)

58-59: LGTM! Test expectations updated correctly.

The physical plan correctly shows FilterExec with its projection directly followed by DataSourceExec, removing the CoalesceBatchesExec wrapper.

datafusion/sqllogictest/test_files/filter_without_sort_exec.slt (1)

41-43: LGTM! Test expectations updated correctly.

All physical plans in this file have been consistently updated to show FilterExec directly in the plan without CoalesceBatchesExec wrappers, maintaining the correct execution semantics.

datafusion/sqllogictest/test_files/tpch/plans/q18.slt.part (1)

93-98: LGTM! Test expectations updated correctly.

The physical plan for this complex query (with subqueries and multiple joins) has been correctly updated to reflect the removal of CoalesceBatchesExec around FilterExec. The execution semantics remain unchanged.

datafusion/sqllogictest/test_files/select.slt (1)

1443-1445: LGTM! Test expectations updated for simplified physical plans.

The explain plan outputs have been consistently updated across multiple queries to reflect the removal of CoalesceBatchesExec wrappers around FilterExec. The new plans show FilterExec directly followed by RepartitionExec and DataSourceExec, which aligns with the PR objective to limit batching coalescing to HashJoinExec operations only.

Also applies to: 1463-1465, 1483-1485, 1503-1505, 1524-1526, 1571-1573

datafusion/sqllogictest/test_files/tpch/plans/q12.slt.part (1)

12-16: LGTM! TPC-H Q12 plan correctly updated.

The physical plan has been updated to remove CoalesceBatchesExec wrapping around the FilterExec on the lineitem side of the hash join (lines 12-13). The HashJoinExec itself still maintains CoalesceBatchesExec wrappers on both inputs (lines 10-11 and 14), which is consistent with the new optimization rule that applies batching coalescing only around join operations.

datafusion/sqllogictest/test_files/parquet_statistics.slt (1)

62-64: LGTM! Statistics test expectations consistently updated.

The explain plan outputs have been updated consistently across all three configuration blocks (default, collect_statistics=true, and collect_statistics=false) to show FilterExec as the first operation in the physical plan. The statistics reporting has been appropriately adjusted to reflect the new plan structure, with column statistics showing Min=Exact(Int64(1)) Max=Exact(Int64(1)) after the filter is applied.

Also applies to: 87-89, 113-115

datafusion/sqllogictest/test_files/tpch/plans/q19.slt.part (1)

9-15: LGTM! TPC-H Q19 plan correctly restructured.

The physical plan has been updated to remove CoalesceBatchesExec wrapping around the FilterExec on the lineitem side of the hash join (lines 9-10). The part side maintains its structure with appropriate batching before repartitioning (lines 11-12) and filtering (line 13). The HashJoinExec properly retains CoalesceBatchesExec wrapping (line 5), consistent with the optimization rule change.

datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt (1)

98-100: LGTM! Parquet filter pushdown tests comprehensively updated.

The explain plan outputs have been consistently updated across multiple test scenarios to reflect the removal of CoalesceBatchesExec wrappers around FilterExec. Additionally, several plans now consolidate projection operations directly into the FilterExec node (e.g., projection=[a@0]) rather than using a separate ProjectionExec, which further simplifies the physical plans. The volatile predicate test (lines 412-413) is also correctly updated to show the same pattern.

Also applies to: 135-137, 266-268, 303-305, 341-343, 412-413

datafusion/sqllogictest/test_files/window.slt (7)

1888-1890: Plan update acknowledged

FilterExec now bypasses the extra CoalesceBatchesExec layer, which aligns with the optimizer change to stop wrapping filters. Looks good.


3545-3546: Plan update acknowledged

This fragment now shows the filter immediately on the scan path, matching the new optimization behavior. Looks good.


3599-3600: Plan update acknowledged

FilterExec remains directly on the scan branch here as well, consistent with the intended optimizer tweak. Looks good.


5304-5307: Plan update acknowledged

The hash repartition + filter ordering now reflects the removal of the CoalesceBatches wrapper around filters. Looks good.


5381-5388: Plan update acknowledged

The plan fragment properly shows both filters outside of any CoalesceBatchesExec, matching the optimizer change. Looks good.


5420-5425: Plan update acknowledged

This plan now keeps the disjunctive filter adjacent to the scan without extra coalescing, which is precisely what we expect. Looks good.


5517-5522: Plan update acknowledged

Filter placement relative to the repartition stages reflects the new rule removal and looks correct.

datafusion/sqllogictest/test_files/projection.slt (1)

279-281: LGTM! Test expectations correctly updated.

The physical plan expectations have been appropriately updated to reflect the removal of CoalesceBatchesExec wrapper around FilterExec. The execution plan structure remains logically sound with correct indentation levels, and the change aligns with the PR objective.

datafusion/sqllogictest/test_files/array.slt (6)

6439-6441: Plan expectation update looks correct (FilterExec unwrapped).

Physical plan now shows FilterExec directly over RepartitionExec/LazyMemoryExec, aligning with removal of CoalesceBatches around filters.


6467-6469: Consistent unwrapping of FilterExec.

Same shape as above; no CoalesceBatchesExec, order preserved: Filter -> Repartition -> Source.


6495-6497: EXPLAIN output matches intended optimizer change.

Direct FilterExec without CoalesceBatchesExec wrapper.


6523-6525: Good: CoalesceBatchesExec removed around FilterExec.

Plan structure consistent with other updated blocks.


6551-6553: Correct plan shape post-rule change.

FilterExec remains directly above Repartition/LazyMemory.


6581-6583: Verification complete: FilterExec unwrapping in scope achieved; scan confirms no problematic direct wrappings.

The provided scan results across all sqllogictest files show CoalesceBatchesExec nodes, but examination of the plan hierarchy reveals no direct wrapping of FilterExec by CoalesceBatchesExec in the sampled output. The expected plan structure (lines 6581–6583 in array.slt) correctly shows FilterExec appearing at a lower nesting level without CoalesceBatchesExec as its immediate parent. The PR objective has been achieved in the verified scope.

datafusion/sqllogictest/test_files/explain.slt (1)

46-48: Plan shape update looks correct

FilterExec before Repartition/DataSource with projection preserves semantics and improves early selectivity. LGTM.

datafusion/sqllogictest/test_files/repartition.slt (1)

124-127: Filter before repartition is desirable

Limit remains above filter (via CoalescePartitions), and pushing FilterExec below it is correct and efficient for streaming. LGTM.

datafusion/sqllogictest/test_files/parquet.slt (4)

457-460: Predicate pushdown preserved; coalesce removed

FilterExec sits above Repartition/DataSource, with predicate retained at DataSourceExec. Snapshot matches intent.


504-507: Same improvement replicated

Consistent FilterExec placement and predicate pushdown. LGTM.


554-557: Consistent plan simplification

No CoalesceBatchesExec around FilterExec; semantics unchanged. LGTM.


668-671: Starts_with pruning case looks good

FilterExec ordering correct; pruning_predicate remains in Parquet exec. LGTM.

datafusion/sqllogictest/test_files/join_is_not_distinct_from.slt (1)

156-160: Pre-join filter ordering is correct

FilterExec is applied before repartitioning inputs to the join; HashJoin remains elsewhere, maintaining coalescing where needed. LGTM.

datafusion/sqllogictest/test_files/aggregate.slt (2)

6025-6028: Limit-with-filter subtree aligns with rule change

FilterExec appears directly in the aggregation subtree; ordering with GlobalLimit remains correct. LGTM.


7167-7173: HAVING filter now explicit and projected

Explicit FilterExec on aggregate output with a precise projection is clearer and correct. LGTM.

datafusion/core/tests/dataframe/mod.rs (3)

769-771: Snapshots updated to show FilterExec without CoalesceBatchesExec: LGTM

These plan strings now place FilterExec directly under Aggregate/DataSource, consistent with removing FilterExec from the CoalesceBatches rule. Good alignment with the PR goal.

Also applies to: 817-819, 867-869


804-806: NULL/inlist plan fragments: LGTM

Physical plans now render FilterExec directly. Semantics unchanged; strings match expected display.

Also applies to: 814-816, 844-846


3329-3352: Review comment appears to be based on incorrect assumptions about rule scope

The grep results found 30+ instances of CoalesceBatchesExec wrapping FilterExec in replace_with_order_preserving_variants.rs, but these are from a different optimization rule than the scalar subquery changes shown in the code snapshot.

The coalesce_batches.rs rule (the one that introduces CoalesceBatchesExec) specifically wraps after HashJoinExec, RepartitionExec, and AsyncFuncExec—not FilterExec. The wrapping of FilterExec in the order-preserving variants tests appears to come from that separate optimization rule's logic, not from the scalar subquery changes.

Without evidence that the scalar subquery optimization rule change applies universally to all CoalesceBatchesExec→FilterExec patterns across the codebase, the suggested verification may be checking the wrong context. The snapshot updates in mod.rs (3329-3352, 3384-3407) are correct for scalar subquery optimization, but the matches in replace_with_order_preserving_variants.rs may be intentional for that separate rule.

Likely an incorrect or invalid review comment.

datafusion/sqllogictest/test_files/predicates.slt (2)

675-683: TPCH OR predicate plan fragment updated: LGTM

FilterExec appears explicitly in the physical plan, not under CoalesceBatchesExec. This matches the optimizer change; results are unchanged.


804-806: Binary operator with NULL and IN-list simplification: LGTM

Updated physical plans correctly show FilterExec directly above DataSourceExec; expectations stay consistent with logical-plan simplifications.

Also applies to: 814-816, 844-846

datafusion/core/tests/sql/explain_analyze.rs (2)

769-781: Display-indent snapshot reflects new FilterExec placement: LGTM

The FilterExec line now sits directly under the partial aggregate subtree with the same semantics. Snapshot matches the intended plan rendering.


1015-1023: Recursive parquet projection pushdown snapshot: LGTM

FilterExec appears without a CoalesceBatches wrapper; path normalization via TMP_DIR remains intact. Behavior and pruning predicates unchanged.

Comment on lines +120 to 123
06)------FilterExec: id@0 < 10
07)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
08)----------WorkTableExec: name=nodes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Refresh batch-size commentary for recursive CTE plans.

These explain outputs used to show CoalesceBatchesExec: target_batch_size=2, which justified the nearby note “use explain to ensure batch size is set to 2.” After removing the Filter coalescing, that evidence disappeared, so the comment is now misleading. Please adjust the wording (or add another observable hook) so the test description reflects what the plan actually shows.

You could, for example, rewrite it to something like:

-# use explain to ensure that batch size is set to 2. This should produce multiple batches per iteration since the input
-# table 'balances' has 4 rows
+# With batch size set to 2 we still produce multiple batches per iteration; the plan no longer prints a Coalesce node, so we
+# rely on the configuration rather than the explain output to confirm the setting.

Also applies to: 164-167

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the comments in the snapshot tests need to be updated to reflect that FilterExec is no more wrapped inside CoalesceBatchesExec

Comment on lines +34 to +35
01)FilterExec: c0@0 < 1
02)--DataSourceExec: partitions=1, partition_sizes=[1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix stale “expect coalesce” comments.

We dropped the CoalesceBatchesExec nodes from these plans, yet the surrounding comments still read “Expect coalesce …” / “expect batch size to be 1234.” That guidance is now misleading because nothing in the snapshot shows a coalesced batch size anymore. Please update (or remove) those remarks so the test explains match the new plan layout.

Apply this diff to keep the documentation truthful:

-# Expect coalesce and default batch size
+# Expect direct filtering without an explicit coalesce node (default batch size)

-# expect batch size to be 1234
+# coalescing is no longer shown; batch size stays at the configured 1234 internally

Also applies to: 74-75

🤖 Prompt for AI Agents
In datafusion/sqllogictest/test_files/options.slt around lines 34-35 (and
similarly update lines 74-75), the inline comments referencing "Expect coalesce"
and specific coalesced batch sizes are stale because CoalesceBatchesExec was
removed; remove or reword those comments to reflect the current plan (e.g., drop
any “Expect coalesce …” phrasing and any mention of a coalesced batch size, or
replace with a neutral note like “Expect no CoalesceBatchesExec; batch sizes
shown are raw from DataSourceExec”), ensuring the snapshot comments accurately
describe the shown plan.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct that the comments in the snapshot tests need to be updated to reflect that FilterExec is no more wrapped inside CoalesceBatchesExec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants